Introduction to R
for Social Scientists

Workshop Day 1B | 2022-07-25
Jeffrey M. Girard | Pitt Methods

Wrangle I

Tidy Data Principles

  • There are many ways to store data
  • We will be learning the tidy data format
    • Data should be rectangular
    • Each variable has its own column
    • Each observation has its own row
    • Each value has its own cell

Other Data Advice

  • Name all variables in the first row
    • This is called a header row
  • Avoid merged cells for data storage
    • These are okay for communication
  • Avoid empty cells whenever possible
    • Mark missing data as NA
  • Avoid formatting-as-data for storage
    • e.g., non-redundant color-coding

Tidying Example 1

Not Tidy

Name Ann Bob Cat Dom
Age 13 10 11 11
Weight 56.4 46.8 41.3 43.3

❌ Here, each row is a variable and each column is an observation.

Tidy

Name Age Weight
Ann 13 56.4
Bob 10 46.8
Cat 11 41.3
Dom 11 43.3

✔️ Here, each column is a variable and each row is an observation.

Tidying Example 2

Not Tidy

Names: Ann Bob Cat Dom
Age Weight
13 56.4
10 46.8
11 41.3
11 43.3

❌ Here, we have data that is not rectangular because the Names variable has its own row.

Tidy

Name Age Weight
Ann 13 56.4
Bob 10 46.8
Cat 11 41.3
Dom 11 43.3

✔️ Here, we have made the data rectangular by moving the Names variable to its own column.

Tidying Example 3

Not Tidy

country year cases / population
Afghanistan 1999 NA / 19987071
2000 2666 / 20595360
Brazil 1999 37737 / 172006362
2000 80488 / 174504898
China 1999 212258 / 1272915272
2000 213766 / 1280428583

❌ Here, we have merged cells and two values stored in a single cell.

Tidy

country year cases population
Afghanistan 1999 NA 19987071
Afghanistan 2000 2666 20595360
Brazil 1999 37737 172006362
Brazil 2000 80488 174504898
China 1999 212258 1272915272
China 2000 213766 1280428583

✔️ Here, we have un-merged the countries and separated the cases and populations variables into columns.

Tidying Example 4

Not Tidy

student grade
Amber 91.5 A-
Bristol 86.2 B
Charlene 94.0 A
Diego 89.3 B+
Legend: Psych. Major, Psych. Minor

❌ Here, we have a missing variable name and formatting-as-data.

Tidy

student psych grade letter
Amber major 91.5 A-
Bristol minor 86.2 B
Charlene major 94.0 A
Diego NA 89.3 B+

✔️ Here, we have added a column for the psych variable, removed the legend, and named the letter variable.

Tidying Example 5

Not Tidy

student grade letter
Amber 91.5 A-
Bristol* 94.2 A
Class Summary
As 2 Yay!
Bs 0
*Grade was revised.

❌ Here, we have two types of data in one file and a footnote as data.

Tidy

student grade letter revised
Amber 91.5 A- FALSE
Bristol 94.2 A TRUE
letter count notes
A 2 Yay!
B 0

✔️ Here, we have split the data into two separate tables and added the revised and notes variables.

Long vs. Wide Format

Wide Format

date Boeing Amazon Google
2009-01-01 $173.55 $174.90 $174.34
2009-01-02 $172.61 $171.42 $170.04

✔️ Here, we have a wide format where each observation is a date.

Long Format

date stock price
2009-01-01 Boeing $173.55
2009-01-01 Amazon $174.90
2009-01-01 Google $174.34
2009-01-02 Boeing $172.61
2009-01-02 Amazon $171.42
2009-01-02 Google $170.04

✔️ Here, we have a long format where each observation is the combination of a date and a stock.

Tibbles

  • R works particularly well with tidy data
  • We store tidy data in data frames or tibbles
    • Tibbles are just fancier data frames
      (i.e., they have a few extra features)
  • To use tibbles, we need the tidyverse package
  • Tibbles are constructed from one or more vectors
    • The vectors must have the same length
    • They can contain different types of data

Vectors

We start with three separate vector objects that all have the same length.

We set it up so that the \(n\)-th car in each train corresponds to the same observation.

Tibble

Then we combine the vectors into a single tibble (or data frame) object.

Now, as the tibble moves around, the variables always stay together.

Tibbles Live Coding

# SETUP: Install and load the tidyverse package

# Extras pane > Packages tab > Install

library(tidyverse)

# ==============================================================================

# LESSON: Create a tibble from vectors

x <- c(10, 20, 30, 40)
x

y <- x * 2 - 4
y

my_tibble <- tibble(x, y)
my_tibble

# ==============================================================================

# USECASE: You can mix different types of vectors in a single tibble

first_names <- c("Adam", "Billy", "Caitlyn", "Debra")

age_years <- c(12, 13, 10, NA)

guests <- tibble(first_names, age_years)
guests

# ==============================================================================

# TIP: To save time, you can also create the vectors in the tibble call

gradebook <- tibble(
  grade = c(95, 83, 90, 76),
  letter = c("a", "b", "a-", "c")
)
gradebook

# ==============================================================================

# PITFALL: Don't try to combine tibbles with different lengths

y <- c(1, 2, 3)
x <- c("a", "b")

tibble(y, x) #error

# ==============================================================================

# LESSON: However, the exception is R will "recycle" a single value

tibble(y, x = "a")

# ==============================================================================

# LESSON: You can "extract" a vector from a tibble using $

mytibble <- tibble(x = c(1, 2, 3, 4, 5), y = "test")

mytibble$x

mytibble$y

# ==============================================================================

# PITFALL: Don't try to extract a vector that doesn't exist

mytibble$z #error

Importing and Exporting

  • Data is usually stored in data files
    • Importing files into R is called reading
    • Exporting files from R is called writing
  • A convenient data file type is a CSV
    • This stands for comma-separated values
    • A CSV file is easy to share with other people
  • The tidyverse package can read/write CSVs
    • Other packages can read/write other types (e.g., readxl, haven, rio, googlesheets4)

Read/Write Live Coding

# SETUP: Load the tidyverse package (if you haven't yet)

library(tidyverse)

# ==============================================================================

# USECASE: Create a tibble and write it to a file

gradebook <- tibble(
  id = c(123, 456, 789),
  grade = c("A", "B", "A")
)
gradebook

write_csv(gradebook, file = "gradebook.csv")

# NOTE: You can see the new file in Extras pane > Files tab.
# You can open the file in another program (e.g., Microsoft Excel).
# You can also email this file to someone else to share it.

# ==============================================================================

# PITFALL: Don't swap the order of the tibble and the file

write_csv("gradebook.csv", gradebook) # error

# ==============================================================================

# USECASE: Read in a file containing data

old_gradebook <- read_csv("gradebook.csv")
old_gradebook

# NOTE: read_csv() will examine and guess the data type of each variable.
# You can tell it the data type of each variable, but that is more advanced.

# ==============================================================================

# PITFALL: Don't use the read.csv() and write.csv() functions

old_gradebook <- read.csv("gradebook.csv") # not a tibble
old_gradebook

Wrangle II

Basic wrangling verbs

  • tidyverse provides tools for wrangling tibbles
    • These functions are named after verbs
    • So if you name your objects after nouns
    • …your code becomes easier to read
Noun(noun) ❌ Verb(noun) ✔️
blender(fruit) blend(fruit)
screwdriver(screw) drive(screw)
boxcutter(box) cut(box)

Basic wrangling verbs

  • Primary Functions (most used)
    • select() retains only certain columns
    • mutate() adds or transforms columns
    • filter() retains rows based on criteria
  • Secondary Functions (less used)
    • arrange() sorts rows by their values
    • rename() changes column titles
    • relocate() moves columns around

Select Live Coding

# SETUP: Load package and inspect example tibble

library(tidyverse) # includes the dplyr package
starwars

# ==============================================================================

# USECASE: Retain only the specified variables

sw <- select(starwars, name)
sw
sw <- select(starwars, name, sex, species)
sw

# ==============================================================================

# PITFALL: Don't forget to save the change with assignment

select(starwars, name, sex, species)
starwars # still includes all variables

# ==============================================================================

# USECASE: Change the order of variables

sw <- select(starwars, species, name, sex)
sw

# ==============================================================================

# USECASE: Retain all variables between two variables

sw <- select(starwars, name, hair_color:eye_color)
sw

# ==============================================================================

# USECASE: Retain all variables except the specified ones

sw <- select(starwars, -sex, -species)
sw
sw <- select(starwars, -c(sex, species))
sw
sw <- select(starwars, -c(hair_color:starships))
sw

Rename Live Coding

# USECASE: Change the name of one or more variables

# TEMPLATE: df2 <- rename(df, new_name = old_name)

starwars

sw <- rename(starwars, Character = name)
sw

sw <- rename(starwars, height_cm = height, mass_kg = mass)
sw

# ==============================================================================

# PITFALL: Don't swap the order and try old_name = new_name

sw <- rename(starwars, name = Character) # error

Relocate Live Coding

# USECASE: Move variables before another variable or position

starwars
sw <- relocate(starwars, sex, .before = height)
sw

sw <- relocate(starwars, species, sex, .before = name)
sw

sw <- relocate(starwars, homeworld, .before = 1)
sw

# ==============================================================================

# PITFALL: Don't forget the period!

sw <- relocate(starwars, sex, before = height) 
sw # height was accidentally renamed to before

# ==============================================================================

# USECASE: Move variables after another variable or position

sw <- relocate(starwars, sex, .after = height)
sw

sw <- relocate(starwars, species, sex, .after = name)
sw

sw <- relocate(starwars, homeworld, .after = 1)
sw

Arrange Live Coding

# USECASE: Sort observations by a variable

starwars

sw <- arrange(starwars, height)
sw # sorted by height, ascending

sw <- arrange(starwars, name)
sw # sorted by name, alphabetically

# ==============================================================================

# USECASE: Sort observations by a variable, in reverse order

sw <- arrange(starwars, desc(height))
sw # sorted by height, descending

sw <- arrange(starwars, desc(name))
sw # sorted by name, reverse-alphabetically

# ==============================================================================

# USECASE: Sort observations by multiple variables

sw <- arrange(starwars, hair_color, mass)
sw # sorted by hair_color, then ties broken by mass

Filter Live Coding

# USECASE: Retain only observations that meet a criterion

sw <- filter(starwars, mass > 100)
sw # only observations with mass greater than 100

sw <- filter(starwars, mass <= 100)
sw # only observations with mass less than or equal to 100

sw <- filter(starwars, species == "Human")
sw # only observations with species equal to Human

sw <- filter(starwars, species != "Human")
sw # only observations with species not equal to Human

# ==============================================================================

# PITFALL: Don't try to use a single = for testing equality

sw <- filter(starwars, height = 150) # error

sw <- filter(starwars, height == 150) # correct
sw 

# ==============================================================================

# PITFALL: Don't forget that R is case-sensitive

sw <- filter(starwars, species == "human")
sw # no observations left (because it should be Human)

# ==============================================================================

# USECASE: Retain only observations that meet complex criteria

sw <- filter(starwars, mass > 100 & height > 200)
sw # only observations with mass over 100 AND height over 200

sw <- filter(starwars, height < 100 | hair_color == "none")
sw # only observations with height under 100 OR hair_color equal to none

# ==============================================================================

# PITFALL: Don't forget to complete both conditions

sw <- filter(starwars, mass > 100 & < 200) # error

sw <- filter(starwars, mass > 100 & mass < 200) # correct
sw

# ==============================================================================

# PITFALL: Don't try to equate a string to a vector

sw <- filter(starwars, species == c("Human", "Droid")) # error

sw <- filter(starwars, species %in% c("Human", "Droid")) # correct
sw

Filter Cheatsheet

Symbol Description Num Chr
< Less than Yes No
<= Less than or equal to Yes No
> More than Yes No
>= More than or equal to Yes No
== Equal to Yes Yes
!= Not equal to Yes Yes
%in% Found in Yes Yes
& Logical And Yes Yes
| Logical Or Yes Yes

Mutate Live Coding

# SETUP: Create example tibble

patients <- tibble(
  id = c("S1", "S2", "S3"),
  feet = c(6, 5, 5),
  inches = c(1, 7, 2),
  pounds = c(176.3, 124.9, 162.6)
)
patients

# ==============================================================================

# USECASE: Add one or more variables

p2 <- mutate(patients, sex = c("M", "F", "F"))
p2

ages <- c(32, 41, 29)
p2 <- mutate(patients, ages = ages)
p2

p2 <- mutate(
  patients, 
  sex = c("M", "F", "F"), 
  ages = ages
)
p2

# ==============================================================================

# USECASE: Compute variables

p2 <- mutate(patients, height = feet + inches / 12)
p2

p2 <- mutate(patients, ln_pounds = log(pounds))
p2

# ==============================================================================

# USECASE: Overwrite variables

patients <- mutate(patients, height = height / 3.281)
patients

# ==============================================================================

# USECASE: Duplicate variables

p2 <- mutate(patients, weight = pounds)
p2 # both weight and pounds exist

Pipes and Pipelines

  • How can we do multiple operations to an object?
    1. x <- 10
    2. x2 <- sqrt(x)
    3. x3 <- round(x2)
  • This works but is cumbersome and error-prone
  • A better approach is to use pipes and pipelines
    • x3 <- 10 |> sqrt() |> round()
  • I like to read |> as “and then…”
    • “Take 10 and then sqrt() and then round()”

Pipes Live Coding

# SETUP: Enable the pipe operator shortcut

# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"

# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)

# ==============================================================================

# LESSON: The pipe pushes objects to a function as its first argument

# TEMPLATE: x |> function_name() is the same as function_name(x)

x <- 10

y <- sqrt(x)
y

y <- x |> sqrt()
y

# ==============================================================================

# PITFALL: Don't forget to remove the object from the function call

x |> sqrt(x) # wrong

x |> sqrt() # correct

# ==============================================================================

# USECASE: You can still use arguments when piping

z <- round(3.14, digits = 1)
z

z <- 3.14 |> round(digits = 1)
z

# ==============================================================================

# USECASE: Pipes are useful with tibbles and wrangling verbs

starwars

sw <- select(starwars, name, species, height)
sw

sw <- starwars |> select(name, species, height)
sw

# ==============================================================================

# PITFALL: Don't add a pipe without a step after it

sw <- starwars |> select(name, species, height) |> # error

Pipelines Live Coding

# USECASE: You can chain multiple pipes together to make a pipeline

x <- 10 |> sqrt() |> round()
x

# ==============================================================================

# TIP: If you want to see the output of a pipeline, you can pipe to print()

x <- 10 |> sqrt() |> round() |> print()

# ==============================================================================

# TIP: To make your pipelines more readable, move each step to a new line

x <- 
  10 |> 
  sqrt() |> 
  round() |>
  print()

# ==============================================================================

# PITFALL: Don't put the pipe at the beginning of a line, though

x <- 
  10 
  |> sqrt()
  |> round()
  |> print() # error

# ==============================================================================

# USECASE: Chain together a series of verbs to flexibly wrangle data

tallones <- 
  starwars |> 
  select(name, species, height) |> 
  rename(height_cm = height) |> 
  mutate(height_ft = height_cm / 30.48) |>  
  filter(height_ft > 7) |> 
  arrange(desc(height_ft)) |>  
  print()

Practice II